home

Reinforcement Learning


What is Reinforcement Learning?

Reinforcement Learning is a type of machine learning where an agent learns to make decisions by interacting with an environment. The agent receives rewards or penalties based on its actions and aims to learn a strategy (policy) that maximizes cumulative rewards over time.

Types of Reinforcement Learning

Reinforcement Learning image
Type What it is When it is used When it is preferred over other types When it is not recommended Examples of projects that is better use it incide him
Model-Free Model-Free Reinforcement Learning is a type of RL where the agent learns directly from experience (rewards and actions) without building a model of the environment’s dynamics. Examples: Q-Learning, SARSA, Policy Gradients. Used when the environment is complex or unknown, and it’s hard to define how actions change states (no transition model available). • Better than Model-Based when the environment is unpredictable or too hard to simulate.
• Better than Inverse RL when reward signals are clearly defined.
• Better than Multi-Agent RL when only one agent acts in the environment.
• When environment simulation is cheap or available (Model-Based would be faster).
• When rewards are sparse or delayed (learning becomes slow).
• When you need interpretability or planning ahead.
• Autonomous game-playing agent (e.g., Atari, CartPole) using Q-Learning or DQN.
• Robot navigation where the robot learns by trial and error.
• Stock trading simulation where the agent learns profitable actions from rewards.
Model-Based Model-Based Reinforcement Learning builds an internal model of the environment (predicting next state and reward) and uses it to plan optimal actions before executing them. Used when the environment’s dynamics can be estimated or simulated, and you want faster learning with fewer real-world interactions. • Better than Model-Free when interactions are expensive (e.g., in robotics).
• Better than Multi-Agent RL when there’s only one agent and full control.
• Better than Inverse RL when rewards are known and well-defined.
• When the environment is highly stochastic or complex to model accurately.
• When the learned model is unreliable (leads to compounding errors).
• When computation or storage for the model is limited.
• Simulated robot arm control where the model predicts movement results before acting.
• Autonomous driving simulators that plan ahead using a learned environment model.
• Game AI that simulates possible moves before choosing the best one (like chess planning).
Multi-Agent RL Multi-Agent Reinforcement Learning (MARL) involves multiple agents interacting within the same environment, each learning its own policy while considering others’ actions. Used when tasks require cooperation, competition, or coordination between multiple decision-makers (agents). • Better than Model-Free or Model-Based when multiple agents must interact or share a space.
• Better than Inverse RL when rewards are known and depend on multi-agent outcomes.
• Ideal for environments with teamwork or competition (e.g., games, swarm control).
• When only one agent exists (simpler RL methods suffice).
• When agent interactions cause instability or non-stationary learning.
• When training cost is too high due to many agents’ combinations.
• Autonomous drone swarms coordinating to cover an area efficiently.
• Multi-player game AI where agents learn cooperative or competitive strategies.
• Traffic light control systems with multiple agents optimizing flow jointly.
Inverse RL Inverse Reinforcement Learning (IRL) aims to learn the reward function behind an expert’s behavior instead of learning the policy directly. The agent observes demonstrations and infers what motivates them. Used when the reward function is unknown or hard to define but expert demonstrations are available. • Better than Model-Free or Model-Based when it’s easier to collect expert examples than to handcraft a reward.
• Better than Multi-Agent RL when only one expert agent exists.
• Ideal for imitation or human-behavior modeling.
• When rewards are already known (standard RL is simpler).
• When expert data is noisy, biased, or limited.
• When computational cost matters — IRL is usually expensive.
• Autonomous driving imitation — learning driving rules from human drivers.
• Robotic manipulation — robot learns tasks by observing human demonstrations.
• Healthcare treatment planning — learning doctors’ decision patterns from patient data.

Code 1 (Q-Learning)


import numpy as np
import random

n_states = 6
n_actions = 2
q_table = np.zeros((n_states, n_actions))
gamma = 0.8
alpha = 0.1
epsilon = 0.2
episodes = 50

def choose_action(state):
    if random.uniform(0,1) < epsilon:
        return random.choice([0,1])
    else:
        return np.argmax(q_table[state])

def get_next_state_reward(state, action):
    if action == 0:
        next_state = max(0, state-1)
    else:
        next_state = min(n_states-1, state+1)
    reward = 10 if next_state == n_states-1 else 0
    return next_state, reward

for ep in range(episodes):
    state = 0
    done = False
    while not done:
        action = choose_action(state)
        next_state, reward = get_next_state_reward(state, action)
        q_table[state, action] = q_table[state, action] + alpha * (
            reward + gamma * np.max(q_table[next_state]) - q_table[state, action]
        )
        state = next_state
        if state == n_states-1:
            done = True

print("Learned Q-Table:")
print(q_table)

state = 0
steps = [state]
while state != n_states-1:
    action = np.argmax(q_table[state])
    state, _ = get_next_state_reward(state, action)
    steps.append(state)

print("Path taken by learned policy:", steps)

Code 2 (Policy Gradient)


import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

n_states = 5
n_actions = 2
gamma = 0.99

class PolicyNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(n_states, n_actions)
    def forward(self, x):
        return torch.softmax(self.fc(x), dim=-1)

policy = PolicyNet()
optimizer = optim.Adam(policy.parameters(), lr=0.01)

def select_action(state):
    state_tensor = torch.FloatTensor(state).unsqueeze(0)
    probs = policy(state_tensor)
    action = torch.multinomial(probs, 1).item()
    return action, probs[0, action]

def get_next_state_reward(state, action):
    next_state = min(max(0, state + (1 if action==1 else -1)), n_states-1)
    reward = 1 if next_state == n_states-1 else 0
    done = next_state == n_states-1
    return next_state, reward, done

def one_hot(state):
    vec = np.zeros(n_states)
    vec[state] = 1
    return vec

for episode in range(200):
    state = 0
    rewards = []
    log_probs = []
    done = False
    while not done:
        state_vec = one_hot(state)
        action, prob = select_action(state_vec)
        log_probs.append(torch.log(prob))
        state, reward, done = get_next_state_reward(state, action)
        rewards.append(reward)
    discounted_rewards = []
    R = 0
    for r in reversed(rewards):
        R = r + gamma * R
        discounted_rewards.insert(0, R)
    discounted_rewards = torch.FloatTensor(discounted_rewards)
    loss = -torch.sum(torch.stack(log_probs) * discounted_rewards)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

print("Training finished!")

state = 0
steps = [state]
while state != n_states-1:
    state_vec = one_hot(state)
    action, _ = select_action(state_vec)
    state, _, _ = get_next_state_reward(state, action)
    steps.append(state)

print("Path taken by learned policy:", steps)

Code 3 (MCTS-based RL)


import math
import random

n_states = 5
n_actions = 2
goal_state = 4

def get_next_state_reward(state, action):
    next_state = min(max(0, state + (1 if action == 1 else -1)), n_states-1)
    reward = 1 if next_state == goal_state else 0
    done = next_state == goal_state
    return next_state, reward, done

class Node:
    def __init__(self, state, parent=None):
        self.state = state
        self.parent = parent
        self.children = {}
        self.visits = 0
        self.value = 0.0

def ucb1(node, child):
    if child.visits == 0:
        return float('inf')
    return child.value / child.visits + math.sqrt(2 * math.log(node.visits) / child.visits)

def tree_policy(node):
    for action in range(n_actions):
        if action not in node.children:
            next_state, _, _ = get_next_state_reward(node.state, action)
            node.children[action] = Node(next_state, node)
            return node.children[action]
    ucb_values = [ucb1(node, child) for child in node.children.values()]
    best_action = list(node.children.keys())[ucb_values.index(max(ucb_values))]
    return node.children[best_action]

def default_policy(state):
    total_reward = 0
    steps = 0
    done = False
    while not done and steps < 10:
        action = random.choice([0,1])
        state, reward, done = get_next_state_reward(state, action)
        total_reward += reward
        steps += 1
    return total_reward

def backup(node, reward):
    while node is not None:
        node.visits += 1
        node.value += reward
        node = node.parent

def mcts(root, iterations=50):
    for _ in range(iterations):
        node = tree_policy(root)
        reward = default_policy(node.state)
        backup(node, reward)
    action_visits = {action: child.visits for action, child in root.children.items()}
    best_action = max(action_visits, key=action_visits.get)
    return best_action

state = 0
steps = [state]
while state != goal_state:
    root = Node(state)
    action = mcts(root, iterations=50)
    state, _, _ = get_next_state_reward(state, action)
    steps.append(state)

print("Path found by MCTS-based agent:", steps)

Code 4 (Adversarial Inverse RL)


import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import random

n_states = 5
n_actions = 2
goal_state = 4

def get_next_state(state, action):
    next_state = min(max(0, state + (1 if action == 1 else -1)), n_states-1)
    return next_state

expert_demo = [0,1,2,3,4]

def one_hot(state):
    vec = np.zeros(n_states)
    vec[state] = 1
    return vec

class PolicyNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(n_states, n_actions)
    def forward(self, x):
        return torch.softmax(self.fc(x), dim=-1)

class Discriminator(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Sequential(
            nn.Linear(n_states + n_actions, 16),
            nn.ReLU(),
            nn.Linear(16,1),
            nn.Sigmoid()
        )
    def forward(self, state, action):
        x = torch.cat([state, action], dim=-1)
        return self.fc(x)

policy = PolicyNet()
discriminator = Discriminator()
policy_opt = optim.Adam(policy.parameters(), lr=0.01)
disc_opt = optim.Adam(discriminator.parameters(), lr=0.01)
loss_fn = nn.BCELoss()

for epoch in range(200):
    state = 0
    agent_states = []
    agent_actions = []
    while state != goal_state:
        state_vec = torch.FloatTensor(one_hot(state)).unsqueeze(0)
        probs = policy(state_vec)
        action = torch.multinomial(probs, 1).item()
        action_vec = torch.zeros((1,n_actions))
        action_vec[0, action] = 1
        agent_states.append(state_vec)
        agent_actions.append(action_vec)
        state = get_next_state(state, action)
    
    expert_states = [torch.FloatTensor(one_hot(s)).unsqueeze(0) for s in expert_demo[:-1]]
    expert_actions = []
    for i in range(len(expert_demo)-1):
        act_vec = torch.zeros((1,n_actions))
        act_vec[0,1] = 1
        expert_actions.append(act_vec)
    
    disc_opt.zero_grad()
    agent_preds = torch.cat([discriminator(s,a) for s,a in zip(agent_states, agent_actions)])
    expert_preds = torch.cat([discriminator(s,a) for s,a in zip(expert_states, expert_actions)])
    
    agent_labels = torch.zeros_like(agent_preds)
    expert_labels = torch.ones_like(expert_preds)
    
    loss = loss_fn(agent_preds, agent_labels) + loss_fn(expert_preds, expert_labels)
    loss.backward()
    disc_opt.step()
    
    policy_opt.zero_grad()
    rewards = torch.cat([discriminator(s,a) for s,a in zip(agent_states, agent_actions)])
    policy_loss = -torch.sum(torch.log(rewards + 1e-8))
    policy_loss.backward()
    policy_opt.step()

state = 0
steps = [state]
while state != goal_state:
    state_vec = torch.FloatTensor(one_hot(state)).unsqueeze(0)
    probs = policy(state_vec)
    action = torch.multinomial(probs,1).item()
    state = get_next_state(state, action)
    steps.append(state)

print("Path learned by AIRL agent:", steps)